US EPA Air Quality


Approach

  1. Filter sensors that covers > 84% dates [2019 - 2020]
  2. Fill in missing median_temp based on sensor latitude using Linear Regression model.

Dictionary



Key Columns


Great! No missing values in sensor related data.

sensor id, lat, long have the same 525 unique values. This means that we can group by 'sensor_id' using median without changing the values of sensor_lat/long.


CBSA fips Dictionary



add fips to EPA data



County fips



Unique sensor list



Choropleth Map - All 525 Sensors



Filter Sensors (525 -> 262)


Data Coverage:

We have a fair coverage of sensor measurement:

Most communities have:

EJ communities are mostly defined by:

Bivariate Analysis


Pair Plot


Observations:

Correlations:


Heatmap


Observations:

let's perform PCA to identify


Linear Regression (fill in missing Temp)


Almost perfect reverse linear relationship (slope = -1) between sensor_lat vs median_temp.

~80% of train data is explained by this linear model: y = 52.66 -1.01 x


Bivariate Analysis



fips Map



Choropleth Map - filtered Sensors



Identify which sensor belongs to EJ communities


Across US Continent:

Within California/New York: